1 Load clean data

Load the cleaned data from the previous steps done in data_preparation.rmd file.

koi_data <- readRDS("data/Rdas/koi_data.Rda")

2 Correlation matrix

Create a correlation matrix to understand the relationships between variables.

# Select only numeric columns for correlation
numerical_cols <- koi_data %>%
  select(
    koi_period, koi_duration, koi_depth, koi_prad, koi_teq,
    koi_insol, koi_model_snr, koi_steff, koi_slogg, koi_srad,
    koi_smass, koi_impact, koi_ror, koi_srho, koi_sma, koi_incl,
    koi_dor, koi_ldm_coeff1, koi_ldm_coeff2, koi_smet
  ) %>%
  drop_na()
# Calculate the correlation matrix
cor_matrix <- cor(numerical_cols)
# Visualize the correlation matrix
ggcorrplot(cor_matrix,
  hc.order = TRUE, # Hierarchical clustering
  type = "upper", # Show upper triangle
  lab = TRUE, # Show correlation coefficients
  lab_size = 3, # Adjust label size
  method = "circle", # Use circles to represent correlation
  colors = c("#6D9EC1", "white", "#E46726")
) # Specify color scheme

The correlation matrix shows us that there are some strong relationships between some variables. For example, the correlation between koi_period and koi_duration is 0.99, indicating a very strong positive relationship. This suggests that as the orbital period increases, the transit duration also tends to increase.

3 PCA analysis

Perform PCA on the selected numerical variables.

numerical_pca_cols <- koi_data %>%
  select(
    koi_period, koi_duration, koi_depth, koi_prad, koi_teq,
    koi_insol, koi_model_snr, koi_steff, koi_slogg, koi_srad,
    koi_smass, koi_impact, koi_ror, koi_srho, koi_sma, koi_incl,
    koi_dor, koi_ldm_coeff1, koi_ldm_coeff2, koi_smet
  )

disposition_col <- koi_data$koi_pdisposition
pca_data_complete <- numerical_pca_cols %>% drop_na()
disposition_complete <- disposition_col[complete.cases(numerical_pca_cols)]

if (length(disposition_complete) != nrow(pca_data_complete)) {
  stop("Mismatch between data rows and disposition labels after handling NAs.")
}

# Scale the Data (Standardize)
scaled_pca_data <- scale(pca_data_complete)
pca_result <- prcomp(scaled_pca_data, center = FALSE, scale. = FALSE)

3.1 PCA Summary

Shows proportion of variance explained by each component

summary(pca_result)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.8409 1.7355 1.6688 1.5467 1.24685 1.12924 1.09109
## Proportion of Variance 0.1694 0.1506 0.1393 0.1196 0.07773 0.06376 0.05952
## Cumulative Proportion  0.1694 0.3200 0.4593 0.5789 0.65663 0.72039 0.77992
##                           PC8     PC9   PC10   PC11    PC12    PC13    PC14
## Standard deviation     0.9317 0.83983 0.8246 0.6914 0.66262 0.62824 0.52575
## Proportion of Variance 0.0434 0.03527 0.0340 0.0239 0.02195 0.01973 0.01382
## Cumulative Proportion  0.8233 0.85858 0.8926 0.9165 0.93844 0.95817 0.97199
##                           PC15    PC16    PC17    PC18    PC19    PC20
## Standard deviation     0.45004 0.41454 0.33987 0.23551 0.10610 0.05948
## Proportion of Variance 0.01013 0.00859 0.00578 0.00277 0.00056 0.00018
## Cumulative Proportion  0.98212 0.99071 0.99649 0.99926 0.99982 1.00000
fviz_eig(pca_result, addlabels = TRUE)

From the eigenvalues, we can see that the first two principal components explain approximately 32% of the total variance. This suggests that the first two principal components does not capture much of the variability in the data. We need the first 11 PCA to get over 90% of the variance, suggesting that the underlying structure of the data (based on these numerical variables) is quite complex. There isn’t a simple, low-dimensional linear subspace that captures most of the information.

3.2 PCA Loadings

Show how original variables contribute to each PC using rotation matrix. The loadings tell us how much each original variable contributes to each principal component. Larger absolute values mean stronger influence. The sign (+/-) indicates the direction of the correlation.

print(pca_result$rotation)
##                        PC1          PC2          PC3          PC4         PC5
## koi_period      0.30639694  0.347714363  0.274495160 -0.008366897  0.01847917
## koi_duration    0.04269443  0.231972796  0.115843053 -0.015138855 -0.07234090
## koi_depth      -0.12450017  0.127549414 -0.052461072 -0.140470095 -0.62521527
## koi_prad       -0.15286222  0.058554951  0.120548664 -0.435568096  0.11140140
## koi_teq        -0.42635722 -0.120328101  0.183977631  0.122937219  0.02213583
## koi_insol      -0.14607512 -0.123562878  0.304706726  0.046459281 -0.09016303
## koi_model_snr  -0.11924810  0.147469117 -0.048934105 -0.089975397 -0.62370781
## koi_steff      -0.26437111  0.362258812 -0.072322603  0.209410982  0.07667573
## koi_slogg       0.25605738 -0.025127674 -0.435100454 -0.115861946  0.02756831
## koi_srad       -0.17033395 -0.129612484  0.444175616  0.034644448 -0.08661357
## koi_smass      -0.29636009  0.203438378  0.260224242  0.222440536  0.09382075
## koi_impact     -0.19050425  0.118816716  0.001398144 -0.508153075  0.22162035
## koi_ror        -0.17094262  0.131493527  0.009384989 -0.548172046  0.04619732
## koi_srho        0.06774909  0.066623582  0.081594434 -0.177212392  0.11099283
## koi_sma         0.30796798  0.361668040  0.286202772  0.001518795  0.01479373
## koi_incl        0.25332897 -0.005414288  0.049045471  0.057546845 -0.23841575
## koi_dor         0.29903665  0.280085628  0.250674161 -0.041381312  0.02445484
## koi_ldm_coeff1  0.21466163 -0.409881812  0.244745270 -0.174726148 -0.08714585
## koi_ldm_coeff2 -0.16513550  0.376579150 -0.276071465  0.151030824  0.10000747
## koi_smet        0.06242363 -0.095393670  0.123653316  0.091883359  0.17810853
##                         PC6          PC7         PC8         PC9        PC10
## koi_period      0.119360611  0.051875537 -0.06019383 -0.02002505 -0.18962741
## koi_duration    0.570839165 -0.215695497  0.20280440  0.23681782  0.55006942
## koi_depth      -0.068584639 -0.079795012  0.11874165  0.06465849 -0.13349936
## koi_prad       -0.131240154 -0.047792214 -0.09561726 -0.12698835  0.32123182
## koi_teq        -0.019868440  0.150918255  0.13576536 -0.03384746 -0.22829643
## koi_insol       0.041738202  0.369541809 -0.31817638  0.64825953 -0.02700493
## koi_model_snr  -0.054104702 -0.108447734  0.11403543  0.07830158 -0.08374526
## koi_steff      -0.126839948 -0.122390073 -0.02680701  0.03044942 -0.06753539
## koi_slogg       0.006172822  0.078111944 -0.06601230  0.34949241 -0.13511190
## koi_srad        0.038663393  0.194322150 -0.15544735 -0.03160191  0.15628893
## koi_smass      -0.160901331 -0.307606379  0.06812431 -0.12983491 -0.02450837
## koi_impact      0.093935585 -0.062617207 -0.13387480  0.03102445 -0.17452222
## koi_ror        -0.050116478 -0.081455459 -0.18216056  0.02736957 -0.09446960
## koi_srho       -0.487708565  0.284586266  0.61362328  0.20829062  0.32003174
## koi_sma         0.097524343 -0.007427604 -0.09031723 -0.02579694 -0.09548265
## koi_incl       -0.460956458 -0.064815858 -0.51388871 -0.15445623  0.37733221
## koi_dor        -0.175128087  0.185140235  0.14716568 -0.01272897 -0.31841721
## koi_ldm_coeff1  0.066040003 -0.172049829  0.13495052 -0.07410840 -0.09869411
## koi_ldm_coeff2 -0.062357540  0.202944864 -0.15945606  0.12053377  0.13967629
## koi_smet       -0.281578572 -0.645526832 -0.01668499  0.51474574 -0.09385791
##                       PC11         PC12        PC13         PC14         PC15
## koi_period     -0.11820578  0.061591783 -0.01431356  0.076013208  0.459993894
## koi_duration    0.12960162  0.109252424  0.00432354  0.066515371 -0.103379379
## koi_depth      -0.04267725  0.396045760  0.54672910 -0.064810340 -0.012008518
## koi_prad       -0.73163329  0.182306235 -0.15406092 -0.005183395 -0.095637702
## koi_teq        -0.12060014  0.185245607 -0.00728660  0.303339231  0.376869420
## koi_insol       0.04266643  0.177584952 -0.20029275  0.179851481 -0.183178017
## koi_model_snr  -0.09554845 -0.510837854 -0.49530116  0.042730159  0.058988510
## koi_steff       0.10671698  0.361573273 -0.35262834 -0.486478946  0.016285554
## koi_slogg      -0.11500813  0.186729780 -0.12435748 -0.334522107  0.120528755
## koi_srad       -0.01654610 -0.328290690  0.26449586 -0.659210103  0.102248018
## koi_smass       0.15456549  0.084960880 -0.08450251  0.080795833 -0.200376338
## koi_impact      0.32277566 -0.149681992 -0.01361718 -0.031637396  0.047790992
## koi_ror         0.28721255  0.004764452  0.09098955  0.080315291  0.004165243
## koi_srho        0.20516679 -0.005572167 -0.02150910 -0.021955538  0.210846592
## koi_sma        -0.04104112  0.017431854 -0.02271675  0.016874341  0.262161956
## koi_incl        0.25593040  0.131709318 -0.05710512  0.143027074  0.077333599
## koi_dor        -0.07984545 -0.055057826  0.05838449 -0.012100698 -0.627522154
## koi_ldm_coeff1  0.06933213  0.130826127 -0.13079360 -0.004982672 -0.011208722
## koi_ldm_coeff2 -0.13521739 -0.297682322  0.30880508  0.202161085  0.004390644
## koi_smet       -0.18008029 -0.194697850  0.21481309  0.003904580  0.092571406
##                       PC16         PC17         PC18          PC19
## koi_period      0.03802923 -0.029875167 -0.021412450  6.437068e-01
## koi_duration   -0.29905083 -0.103337964  0.096615121  2.303355e-03
## koi_depth       0.16802341  0.069849268  0.112401644  1.515136e-03
## koi_prad        0.06895291  0.002454426  0.048781159  2.186547e-03
## koi_teq        -0.52158913 -0.189546320  0.179027991 -1.908513e-01
## koi_insol       0.22205127  0.087711186 -0.042019739  4.009233e-02
## koi_model_snr  -0.03694752 -0.039322810  0.004593793 -4.513809e-03
## koi_steff      -0.16381831  0.350713071 -0.051366675  1.855004e-02
## koi_slogg      -0.07264963 -0.607563323  0.095505260 -7.671468e-02
## koi_srad       -0.09304371 -0.165404261 -0.006564376  8.813710e-03
## koi_smass       0.38913201 -0.600489892  0.005196951  7.156391e-02
## koi_impact      0.11422003  0.096604457  0.653988234 -8.810851e-05
## koi_ror        -0.24061955 -0.120915818 -0.654949610 -1.554008e-03
## koi_srho        0.10595934  0.023054427 -0.010784763  1.216639e-03
## koi_sma         0.22260594  0.040643864 -0.077415692 -7.320998e-01
## koi_incl       -0.24358061 -0.075401764  0.220602020 -2.777929e-03
## koi_dor        -0.38968330 -0.050942658  0.132763452  2.187147e-03
## koi_ldm_coeff1  0.03353477 -0.044421110 -0.002719654 -7.414143e-03
## koi_ldm_coeff2  0.01801750 -0.069436056  0.011572288 -8.901582e-03
## koi_smet       -0.12982706  0.130710575  0.017994995 -5.298896e-03
##                         PC20
## koi_period      6.044490e-03
## koi_duration    5.976220e-04
## koi_depth      -1.064006e-03
## koi_prad        1.674146e-04
## koi_teq         2.041897e-03
## koi_insol      -1.814064e-03
## koi_model_snr   5.295556e-03
## koi_steff       2.245508e-01
## koi_slogg      -1.346323e-05
## koi_srad        1.387013e-02
## koi_smass      -9.313680e-03
## koi_impact      5.237360e-03
## koi_ror        -5.471374e-03
## koi_srho        3.722707e-04
## koi_sma        -5.138471e-03
## koi_incl       -4.704890e-04
## koi_dor        -2.463456e-04
## koi_ldm_coeff1  7.603596e-01
## koi_ldm_coeff2  6.073264e-01
## koi_smet       -4.634635e-02

Visualize Loadings for PC1 and PC2

fviz_pca_var(pca_result,
  col.var = "contrib", # Color by contributions
  gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
  repel = TRUE
)

Analysis of the component loadings revealed distinct patterns captured by the principal components.

  • PC1 (~17% Var): Seems to represent a contrast between orbital size/period and temperature. It has high positive loadings for koi_period, koi_sma, koi_dor (larger orbits) and high negative loadings for koi_teq (cooler temperatures associated with larger orbits). Stellar properties (koi_slogg, koi_steff, koi_smass) also contribute moderately.
  • PC2 (~15% Var): Also strongly related to orbital size/period (positive loadings for koi_period, koi_sma, koi_dor) but also strongly incorporates stellar temperature (koi_steff positive loading) and limb darkening (koi_ldm_coeff1 negative, koi_ldm_coeff2 positive).
  • PC3 (~14% Var): Primarily related to stellar properties, contrasting stellar radius/insolation (koi_srad, koi_insol positive) with stellar surface gravity (koi_slogg negative). Orbital size variables also contribute moderately.
  • PC4 (~12% Var): Dominated by relative planet size and transit geometry, with high negative loadings for koi_prad, koi_ror (planet/star radius ratio), and koi_impact.
  • PC5 (~8% Var): Represents the transit signal strength, dominated by high negative loadings for koi_depth and koi_model_snr.
  • Later PCs: Capture more nuanced relationships. PC6 relates transit duration and stellar density (koi_duration, koi_srho). PC7 involves insolation and metallicity (koi_insol, koi_smet). PC19/PC20 seem to isolate specific period/axis relationships and limb darkening effects.

These interpretations suggest that the primary sources of variation in the dataset relate to the transit signal strength, stellar characteristics, transit geometry, and orbital properties.

3.3 PCA Plots

Combine PCA results with the disposition information and plot the results.

pca_plot_data <- data.frame(
  PC1 = pca_result$x[, 1],
  PC2 = pca_result$x[, 2],
  Disposition = disposition_complete
)

autoplot(pca_result,
  data = data.frame(pca_data_complete, Disposition = disposition_complete), colour = "Disposition",
  loadings = TRUE, loadings.colour = "blue",
  loadings.label = TRUE, loadings.label.size = 3
) +
  labs(title = "PCA Plot with Loadings") +
  theme_minimal()

fviz_pca_ind(pca_result,
  geom.ind = "point", # show points only (but can use "text")
  col.ind = disposition_complete, # color by groups
  palette = "jco", # Journal color palette
  addEllipses = TRUE, # Concentration ellipses
  legend.title = "Disposition"
) +
  ggtitle("PCA Plot of Individuals")

pca_scores_df_7 <- data.frame(pca_result$x[, 1:7], Disposition = disposition_complete)

ggpairs(pca_scores_df_7,
  columns = 1:7, # Specify columns for the PC dimensions
  aes(color = Disposition, alpha = 0.6), # Map color and transparency to Disposition
  upper = list(continuous = wrap("cor", size = 3)), # Show correlation in upper panels
  lower = list(continuous = wrap("points", size = 1)), # Show scatter plots in lower panels
  diag = list(continuous = wrap("densityDiag", alpha = 0.5)), # Show density plots on diagonal
  title = "Pairs Plot Matrix of First 7 Principal Components"
) +
  theme_minimal() + # Apply a theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

4 Data visualization

4.1 Distribution of Dispositions

First, let’s see the balance between the different dispositions in the dataset using the pipeline disposition (koi_pdisposition).

ggplot(koi_data %>% filter(!is.na(koi_pdisposition)), aes(x = koi_pdisposition, fill = koi_pdisposition)) +
  geom_bar() +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
  labs(
    title = "Distribution of Pipeline Dispositions",
    x = "Pipeline Disposition (koi_pdisposition)",
    y = "Count"
  ) +
  theme_minimal() +
  theme(legend.position = "none") # Hide legend as fill is redundant

This plot shows the number of KOIs classified as CANDIDATE vs. FALSE POSITIVE by the Kepler pipeline (within the loaded dataset, potentially after some filtering/NA removal). We can observe the relative balance between these classes, which is important context for model building and evaluation (e.g., calculating baseline accuracy). The classes appear reasonably balanced in this dataset.

4.2 Stellar Metallicity vs. Planetary Radius

Explore if planet size relates to the host star’s metallicity.

ggplot(
  koi_data %>% filter(!is.na(koi_smet), !is.na(koi_prad), !is.na(koi_pdisposition), koi_prad > 0),
  aes(x = koi_smet, y = koi_prad, color = koi_pdisposition)
) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_y_log10(breaks = c(0.1, 0.3, 1, 3, 10, 30), labels = scales::label_number(accuracy = 0.1)) + # Planet radius often plotted on log scale
  labs(
    title = "Stellar Metallicity vs. Planetary Radius",
    x = "Stellar Metallicity [Fe/H] (koi_smet)",
    y = "Planetary Radius [Earth Radii] (koi_prad) (log scale)",
    color = "Pipeline Disposition"
  ) +
  theme_minimal() +
  annotation_logticks(sides = "l") # Add log ticks to y-axis

This plot investigates whether larger planets tend to form around stars with higher metallicity (more heavy elements). While some studies suggest such a trend, especially for gas giants, it might not be strongly apparent here without statistical analysis. We can visually inspect if CANDIDATEs (blue) and FALSE POSITIVEs (red) occupy different regions or show different trends in this parameter space. False positives might appear across the metallicity range.

4.3 Distribution of Planetary Radii

Understand the frequency of different planet sizes.

ggplot(koi_data %>% filter(!is.na(koi_prad), koi_prad > 0), aes(x = koi_prad)) +
  geom_histogram(bins = 50) + # Adjust binwidth/bins as needed
  scale_x_log10(breaks = c(0.1, 0.3, 1, 3, 10, 30, 100), labels = scales::label_number(accuracy = 0.1)) +
  labs(
    title = "Distribution of Planetary Radius",
    x = "Planetary Radius [Earth Radii] (koi_prad) (log scale)",
    y = "Count"
  ) +
  theme_minimal() +
  annotation_logticks(sides = "b") # Add log ticks to x-axis

This histogram reveals the distribution of detected planet candidate sizes. We often expect to see peaks corresponding to common planet types (like super-Earths/mini-Neptunes around 1.5-4 Earth radii) and potentially a dip known as the “radius valley” or “Fulton gap” around 1.5-2 Earth radii, separating rocky super-Earths from gaseous mini-Neptunes. The distribution is heavily influenced by detection biases (larger planets are easier to find).

4.4 Distribution of Orbital Periods

Understand the frequency of different orbital periods.

ggplot(koi_data %>% filter(!is.na(koi_period), koi_period > 0), aes(x = koi_period)) +
  geom_histogram(bins = 50) + # ggplot chooses bins, or set binwidth/bins
  scale_x_log10(breaks = c(0.1, 1, 10, 100, 1000)) +
  labs(
    title = "Distribution of Orbital Periods",
    x = "Orbital Period [Days] (log scale)",
    y = "Count"
  ) +
  theme_minimal() +
  annotation_logticks(sides = "b") # Add log ticks to x-axis

This histogram shows that the vast majority of detected KOIs have short orbital periods (typically less than 50-100 days). This is largely due to detection bias: planets with shorter periods transit more frequently, making them easier to detect in the fixed duration of the Kepler mission.

4.5 Orbital Period vs. Planetary Radius

A classic plot in exoplanet studies, often revealing distinct populations. Color by disposition.

ggplot(
  koi_data %>% filter(!is.na(koi_prad), !is.na(koi_period), koi_prad > 0, koi_period > 0, !is.na(koi_pdisposition)),
  aes(x = koi_period, y = koi_prad, color = koi_pdisposition)
) +
  geom_point(alpha = 0.5, size = 1.5) + # Adjust alpha/size
  scale_x_log10(breaks = c(0.1, 1, 10, 100, 1000)) +
  scale_y_log10(breaks = c(0.1, 0.3, 1, 3, 10, 30, 100), labels = scales::label_number(accuracy = 0.1)) +
  labs(
    title = "Orbital Period vs. Planetary Radius",
    x = "Orbital Period [Days] (log scale)",
    y = "Planetary Radius [Earth Radii] (log scale)",
    color = "Disposition" # Using Archive Disposition here
  ) +
  theme_minimal() + # Or other themes
  annotation_logticks(sides = "lb") # Add log ticks to both axes

This fundamental plot shows planet radius against orbital period. We can identify known exoplanet populations: Hot Jupiters (large radius, short period - top left), potentially a “Neptunian desert” (a region with fewer Neptune-sized planets at very short periods), and the bulk of smaller planets (Super-Earths/Mini-Neptunes). Coloring by disposition helps visualize where candidates (blue), and false positives (red) lie. False positives might cluster in certain areas (e.g., very large radii suggesting eclipsing binaries) or be scattered throughout.

4.6 Insolation Flux vs. Planetary Radius

Explore potential atmospheric regimes based on stellar energy received.

ggplot(
  koi_data %>% filter(!is.na(koi_prad), !is.na(koi_insol), koi_prad > 0, koi_insol > 0, !is.na(koi_pdisposition)),
  aes(x = koi_insol, y = koi_prad, color = koi_pdisposition)
) +
  geom_point(alpha = 0.5) +
  scale_x_log10() + # Insolation often spans orders of magnitude
  scale_y_log10(breaks = c(0.1, 0.3, 1, 3, 10, 30, 100), labels = scales::label_number(accuracy = 0.1)) +
  labs(
    title = "Insolation Flux vs. Planetary Radius",
    x = "Insolation Flux [Earth Flux] (log scale)",
    y = "Planetary Radius [Earth Radii] (log scale)",
    color = "Disposition"
  ) +
  theme_minimal() +
  annotation_logticks(sides = "lb") # Add log ticks to both axes

This plot relates the amount of energy a planet receives from its star to its size. High insolation can affect planetary atmospheres (e.g., photo-evaporation potentially contributing to the radius valley). We can examine if candidates and false positives separate based on these parameters. For instance, highly irradiated large objects might be more likely to be false positives (binaries).

4.7 Transit Depth vs. SNR

Explore the relationship between the measured transit depth and its signal-to-noise ratio.

ggplot(
  koi_data %>% filter(!is.na(koi_depth), !is.na(koi_model_snr), koi_depth > 0, koi_model_snr > 0, !is.na(koi_pdisposition)),
  aes(x = koi_depth, y = koi_model_snr, color = koi_pdisposition)
) +
  geom_point(alpha = 0.5) +
  scale_x_log10() +
  scale_y_log10() +
  labs(
    title = "Transit Depth vs. Model Signal-to-Noise Ratio",
    x = "Transit Depth [ppm] (log scale)",
    y = "Transit Signal-to-Noise Ratio (log scale)",
    color = "Pipeline Disposition"
  ) +
  theme_minimal() +
  annotation_logticks(sides = "lb")

As expected, there is a strong positive correlation between transit depth and SNR – deeper transits are easier to detect with higher confidence. This plot helps visualize if false positives tend to cluster at lower SNRs or specific depths. Some false positives might have high SNR but other characteristics (like V-shaped transits, not shown here) that disqualify them. Candidates span a wide range of depths and SNRs.

4.8 Boxplots of Key Variables by Disposition

Compare the distributions of important numeric variables between CANDIDATEs and FALSE POSITIVEs.

# Example: Orbital Period
p1 <- ggplot(
  koi_data %>% filter(!is.na(koi_period), !is.na(koi_pdisposition), koi_period > 0),
  aes(x = koi_pdisposition, y = koi_period, fill = koi_pdisposition)
) +
  geom_boxplot(outlier.shape = NA) + # Hide outliers for clarity on main distribution
  scale_y_log10(limits = c(NA, quantile(koi_data$koi_period, 0.99, na.rm = TRUE))) + # Zoom y-axis, adjust quantile if needed
  labs(y = "Orbital Period (log)", x = "Disposition") +
  theme_minimal() +
  theme(legend.position = "none")

# Example: Planetary Radius
p2 <- ggplot(
  koi_data %>% filter(!is.na(koi_prad), !is.na(koi_pdisposition), koi_prad > 0),
  aes(x = koi_pdisposition, y = koi_prad, fill = koi_pdisposition)
) +
  geom_boxplot(outlier.shape = NA) +
  scale_y_log10(limits = c(NA, quantile(koi_data$koi_prad, 0.99, na.rm = TRUE))) +
  labs(y = "Planetary Radius (log)", x = "Disposition") +
  theme_minimal() +
  theme(legend.position = "none")

# Example: Transit Duration
p3 <- ggplot(
  koi_data %>% filter(!is.na(koi_duration), !is.na(koi_pdisposition), koi_duration > 0),
  aes(x = koi_pdisposition, y = koi_duration, fill = koi_pdisposition)
) +
  geom_boxplot(outlier.shape = NA) +
  scale_y_continuous(limits = c(NA, quantile(koi_data$koi_duration, 0.99, na.rm = TRUE))) + # May not need log scale
  labs(y = "Transit Duration", x = "Disposition") +
  theme_minimal() +
  theme(legend.position = "none")

# Example: Transit SNR
p4 <- ggplot(
  koi_data %>% filter(!is.na(koi_model_snr), !is.na(koi_pdisposition), koi_model_snr > 0),
  aes(x = koi_pdisposition, y = koi_model_snr, fill = koi_pdisposition)
) +
  geom_boxplot(outlier.shape = NA) +
  scale_y_log10(limits = c(NA, quantile(koi_data$koi_model_snr, 0.99, na.rm = TRUE))) +
  labs(y = "Model SNR (log)", x = "Disposition") +
  theme_minimal() +
  theme(legend.position = "none")

# Show plots sequentially if packages not loaded/preferred
print(p1 + labs(title = "Period Distribution"))

print(p2 + labs(title = "Radius Distribution"))

print(p3 + labs(title = "Duration Distribution"))

print(p4 + labs(title = "SNR Distribution"))

These boxplots compare the central tendency (median) and spread (interquartile range) of key variables between pipeline CANDIDATEs and FALSE POSITIVEs. Significant differences in the distributions suggest a variable might be a good discriminator between the classes. For example, we might observe that FALSE POSITIVEs tend to have larger median radii or perhaps shorter durations compared to CANDIDATEs, although overlap is expected. Variables showing clear separation are likely important features for predictive models. (Note: axis limits are adjusted to focus on the bulk of the distribution, hiding extreme outliers for visual clarity of the boxes).